Add YOLO26 object detection contrib model#151
Open
jimburtoft wants to merge 3 commits intoaws-neuron:mainfrom
Open
Add YOLO26 object detection contrib model#151jimburtoft wants to merge 3 commits intoaws-neuron:mainfrom
jimburtoft wants to merge 3 commits intoaws-neuron:mainfrom
Conversation
Ultralytics YOLO26 (n/s/m/l/x) on Trainium2 via torch_neuronx.trace(). All 5 detection variants plus pose and OBB task heads compile and run with high accuracy (CosSim 0.987-0.997). Peak throughput on trn2.3xlarge (LNC=1, DP=8): - YOLO26s: 1,523 img/s (1.43x vs A10G compiled) - YOLO26m: 1,267 img/s (2.67x vs A10G compiled) - YOLO26l: 1,093 img/s (2.95x vs A10G compiled) - YOLO26x: 876 img/s (4.49x vs A10G compiled) Includes modeling module, 13 integration tests (all passing), Jupyter notebook, and README with benchmarks.
Tested all 4 combinations: - trn2.3xlarge SDK 2.28: 13/13 pytest passed - trn2.3xlarge SDK 2.29: 13/13 pytest passed - inf2.xlarge SDK 2.28: 6/6 standalone tests passed - inf2.xlarge SDK 2.29: 6/6 standalone tests passed inf2 single-core throughput: yolo26n 60-70 img/s, yolo26s 64-77 img/s. Updated compatibility matrix and notebook prerequisites.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
torch_neuronx.trace()Validation
Validated on 4 configurations: trn2.3xlarge × {SDK 2.28, 2.29} and inf2.xlarge × {SDK 2.28, 2.29}.
Peak Throughput (trn2.3xlarge, LNC=1, DP=8)
Files
Key Design Decisions
torch_neuronx.trace()(not NxDI model classes): YOLO26 is a CNN with no KV cache, no attention matrices, no token generation. All variants fit on a single NeuronCore (<180 MB NEFF). Data Parallelism provides throughput scaling.end2end=False:topk/sortoperations are not supported on Neuron (NCC_EVRF029). Raw[B, 84, 8400]output with CPU postprocessing (~0.1ms overhead).NCC_IGCA030). n/s use FP32.--auto-castflags:matmultautocast produces NaN for Conv2d-dominant models.--lnc 1compiler flag required when running on LNC=1 mode.Target
aws-neuron/neuronx-distributed-inferencemain branch.